suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(forcats))
suppressPackageStartupMessages(library(plotly))
1. Drop Oceania Before dropping the unused data the country and continent have 142 and 5 levels repectively. In addition, the row number before dropping is 1704.
#before dropping Oceania, the number of levels of continent is 5.
nrow(gapminder)
## [1] 1704
Ocean_drop <- gapminder %>%
filter(continent != "Oceania")
str(Ocean_drop)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
nrow(Ocean_drop) # Number of rows after dropping levels
## [1] 1680
After filting out the Oceania, the row number drop to 1680. By appling fct_drop on continent, only the level of continent drop to 4 by removing Oceania. And while droplevels() was applied, it applied to all the factors in the structure, therefore, the level of country drop to 140 which mean the country in Oceania were removed. And the level of continent is also reduced to 4.
# dropping unused Oceania data
drop1 <- Ocean_drop %>%
mutate(continent=fct_drop(continent))
str(drop1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
nrow(drop1) # the row number remaining at 1680
## [1] 1680
drop2 <- Ocean_drop %>%
droplevels()
str(drop2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
nrow(drop2)# Number of rows after dropping levels
## [1] 1680
2. Reorder the levels of country or continent The continent is originally ordered alphabetically, here we used different method to reorder the levels of the continient. First, we will check the original looking of the gapminder dataset before reordering. It is obviously that the continent is ordered by alphabetically.
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
Now we try to reordered the levels of the continent in the following different ways:
gapminder %>%
mutate(continent=fct_reorder(continent,gdpPercap,max)) %>%
ggplot(aes(continent,gdpPercap)) + geom_boxplot(aes(fill=continent))
# to verify that this is reordered using maximum GDP per capital of each continent, we do the following:
gapminder %>%
group_by(continent) %>%
# calculate the maximum gdpPercap in each continent
summarize(
max_gdp = max(gdpPercap)
) %>%
arrange(max_gdp) %>% # which is consistent with the order in the box plot
knitr::kable()
| continent | max_gdp |
|---|---|
| Africa | 21951.21 |
| Oceania | 34435.37 |
| Americas | 42951.65 |
| Europe | 49357.19 |
| Asia | 113523.13 |
gapminder %>%
mutate(continent=fct_reorder(continent,gdpPercap,min)) %>%
ggplot(aes(continent,gdpPercap)) + geom_boxplot(aes(fill=continent))
# to verify that this is reordered using maximum GDP per capital of each continent, we do the following:
gapminder %>%
group_by(continent) %>%
# calculate the maximum gdpPercap in each continent
summarize(
min_gdp = min(gdpPercap)
) %>%
arrange(min_gdp) %>% # which is consistent with the order in the box plot
knitr::kable()
| continent | min_gdp |
|---|---|
| Africa | 241.1659 |
| Asia | 331.0000 |
| Europe | 973.5332 |
| Americas | 1201.6372 |
| Oceania | 10039.5956 |
gapminder %>%
mutate(continent=fct_reorder(continent,gdpPercap,mean)) %>%
ggplot(aes(continent,gdpPercap)) + geom_boxplot(aes(fill=continent))
gapminder %>%
group_by(continent) %>%
# calculate the maximum gdpPercap in each continent
summarize(
mean_gdp = mean(gdpPercap)
) %>%
arrange(mean_gdp) %>% # which is consistent with the order in the box plot
knitr::kable()
| continent | mean_gdp |
|---|---|
| Africa | 2193.755 |
| Americas | 7136.110 |
| Asia | 7902.150 |
| Europe | 14469.476 |
| Oceania | 18621.609 |
In order to explore whether this survives the round trip of writing to file then reading back in, we filter the gapminder dataframe and get only the data from Asian country in 2007.
filterdata <- gapminder %>%
filter(continent == "Asia" & year == 2007)
# drop the unused level
asiadata <- filterdata %>%
droplevels()
# check the level of continent and country to make sure the unused data is successfully dropped.
str(asiadata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 6 variables:
## $ country : Factor w/ 33 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ continent: Factor w/ 1 level "Asia": 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ lifeExp : num 43.8 75.6 64.1 59.7 73 ...
## $ pop : int 31889923 708573 150448339 14131858 1318683096 6980412 1110396331 223547000 69453570 27499638 ...
## $ gdpPercap: num 975 29796 1391 1714 4959 ...
write_csv()/read_csv()so now we will try to write the asiadata into the csv fie and again read it out to see whether the write and read has effected the data.
#first check the original data frame asiadata
head(asiadata)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Bahrain Asia 2007 75.6 708573 29796.
## 3 Bangladesh Asia 2007 64.1 150448339 1391.
## 4 Cambodia Asia 2007 59.7 14131858 1714.
## 5 China Asia 2007 73.0 1318683096 4959.
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725.
# wirte the csv file
write_csv(asiadata,"asiadata.csv")
Now we found out that the variable country and continent which used to be factor transfer into a characters after reading back from the csv file. Apart from that, the data remains unchanged.
# read back the csv file
read_data <- read_csv("asiadata.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_integer(),
## lifeExp = col_double(),
## pop = col_integer(),
## gdpPercap = col_double()
## )
is.factor(read_data$country) # which is false
## [1] FALSE
is.factor(read_data$continent)
## [1] FALSE
is.character(read_data$country)
## [1] TRUE
# now check the new read_in data
head(read_data)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Bahrain Asia 2007 75.6 708573 29796.
## 3 Bangladesh Asia 2007 64.1 150448339 1391.
## 4 Cambodia Asia 2007 59.7 14131858 1714.
## 5 China Asia 2007 73.0 1318683096 4959.
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725.
saveRDS()/readRDS()while saving and reading RDS file, the class of the data is preserved, therefore the variable continent and country returned are with class factor.
# save to RDS file
saveRDS(asiadata, "asiadata.rds")
# read from RDS file
read_rdsdata <- readRDS("asiadata.rds")
# check the readin data
head(read_rdsdata)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Bahrain Asia 2007 75.6 708573 29796.
## 3 Bangladesh Asia 2007 64.1 150448339 1391.
## 4 Cambodia Asia 2007 59.7 14131858 1714.
## 5 China Asia 2007 73.0 1318683096 4959.
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725.
is.factor(read_rdsdata$country) # which is true
## [1] TRUE
is.factor(read_rdsdata$continent)
## [1] TRUE
dput()/dget()# put data into file
dput(asiadata, "asiadata.txt")
# get data from text file
data_txt <- dget("asiadata.txt")
head(data_txt)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Bahrain Asia 2007 75.6 708573 29796.
## 3 Bangladesh Asia 2007 64.1 150448339 1391.
## 4 Cambodia Asia 2007 59.7 14131858 1714.
## 5 China Asia 2007 73.0 1318683096 4959.
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725.
is.factor(data_txt$country) # which is true
## [1] TRUE
is.factor(data_txt$continent)
## [1] TRUE
This is similar to saveRDS()/readRDS() which preserve the class of the data and the data remains unchanged while reading in from the txt file.
ggplot(gapminder,aes(gdpPercap,continent))+
geom_point(aes(colour=continent,size=gdpPercap),alpha=0.4)
Here is the modified graph:
# get the max, min, median and mean for each contient in different year
new_table <- gapminder %>%
group_by(continent,year) %>%
summarize(
min_gdp = min(min(gdpPercap)),
max_gdp = max(max(gdpPercap)),
mean_gdp = mean(mean(gdpPercap)),
median_gdp = median(median(gdpPercap))
)
# then we need to gather the gdp together to tidy up the data
tidy_table <- gather(new_table,key = "Stat_GDP", value="GDP_value", min_gdp, max_gdp,mean_gdp,median_gdp)
# then check the new gathered table
knitr::kable(head(tidy_table))
| continent | year | Stat_GDP | GDP_value |
|---|---|---|---|
| Africa | 1952 | min_gdp | 298.8462 |
| Africa | 1957 | min_gdp | 335.9971 |
| Africa | 1962 | min_gdp | 355.2032 |
| Africa | 1967 | min_gdp | 412.9775 |
| Africa | 1972 | min_gdp | 464.0995 |
| Africa | 1977 | min_gdp | 502.3197 |
# now the data is ready for plotting
new_graph <- tidy_table %>%
ggplot(aes(x = year, y = GDP_value, color = Stat_GDP) ) +
facet_wrap(~continent) +
#use log10 in y-axis
scale_y_log10()+
geom_point()+
#use line to show the trend
geom_line()+
ylab("GDP distribution")+
ggtitle("Summarized GDP per Capital vs Year")
new_graph
The differences between these two version are:
ggplot2.ggplotly(new_graph)